Skip to content
This repository has been archived by the owner on May 23, 2022. It is now read-only.

Refactor codebase #36

Merged
merged 15 commits into from
Apr 15, 2020
Merged

Refactor codebase #36

merged 15 commits into from
Apr 15, 2020

Conversation

juliohm
Copy link
Member

@juliohm juliohm commented Apr 14, 2020

This PR addresses issue #35. It refactors the codebase so that we have a single concept per source file. It also disables name exports, which are now exported by downstream components (e.g. LossFunctions.jl).

@juliohm juliohm requested a review from joshday April 14, 2020 13:30
Project.toml Outdated
@@ -3,11 +3,9 @@ uuid = "7f8f8fb0-2700-5f03-b4bd-41f8cfc144b6"
version = "0.3.0"

[deps]
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Markdown = "d6f4376e-aef5-505a-96c1-9c027394607a"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is Markdown used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the docstrings for the LaTeX rendering. The code doesn't compile otherwise.

Copy link
Member

@johnnychen94 johnnychen94 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of @doc doc"""...""" I believe @doc raw"""...""" does the same thing without the need of the package Markdown.

Ref: https://docs.julialang.org/en/v1/manual/documentation/#and-\\-characters-1

@joshday
Copy link
Member

joshday commented Apr 14, 2020

In regards to the list in the mentioned issue:

  1. I don't see the need for splitting LearnBase.jl into multiple files. I'd much rather have everything in one 480-line (after removing exports) file.

  2. 👍 from me


Also, your most significant change is dropping StatsBase, which again is 👍 from me.

@juliohm
Copy link
Member Author

juliohm commented Apr 14, 2020

Thank you for your time reviewing this. I think that splitting this single source file into smaller digestible files is quite beneficial in the long run, particularly because I am planning to extend the interface with other concepts in the near future. I spent a lot of time trying to understand which functions in the interface were related to costs and which were not, and it would be nice if new potential contributors could just jump directly to the concept they care about like for example costs.jl. Could you please reconsider the splitting? What are the major downsides you are seeing?

@johnnychen94
Copy link
Member

A minor version is needed since it's a breaking change by un-exporting the symbols, perhaps 0.4.0-DEV.

@juliohm
Copy link
Member Author

juliohm commented Apr 14, 2020

Thank you @johnnychen94 , we have that in mind. I am will keep improving the PR and then after it is merged I will bump the minor version for a new release.

@joshday I've started adding back the docstrings for value, deriv etc. The file costs.jl by itself already has 370 lines of code. It is a lot to digest in a single file. I will stick with this splitting of source idea, and we can reconsider it in the future. For now, it is very hard to keep track of things on the same file.

test/costs.jl Outdated Show resolved Hide resolved
@juliohm
Copy link
Member Author

juliohm commented Apr 14, 2020

Another point of discussion is the value_deriv wrapper function. In my opinion, it feels like premature optimization of the interface. It adds complexity and it is not being used by any of the direct dependencies of LearnBase.jl. Can I suggest to remove it from the interface for now? That would simplify a lot of code repetition and dispatch complexity moving forward.

@juliohm
Copy link
Member Author

juliohm commented Apr 14, 2020

Ok, I've finished this PR, and the interface is already much easier to follow. I'd would like to summarize the changes here before I go ahead and merge it tomorrow morning (BRT time zone):

Organization

  • The source is now organized by major concepts, costs.jl, aggmode.jl, obsdim.jl, other.jl which are included in the main interface file LearnBase.jl. This is much more digestible and will facilitate future extensions (the file costs.jl alone has 520 lines, and will likely grow with the addition of probabilistic functional losses at some point).

  • The aggregation modes previously defined inside LossFunctions.jl are now part of the interface here. I've copied the module so that users will still need to qualify the names as before (e.g. AggMode.Mean()).

  • Similarly, the observation dimension conventions have their own module moved to obsdim.jl.

  • Added a new abstract type UnarySupervisedLoss to model both MarginLoss and DistanceLoss, or any loss that can be written as a composition L(y,yhat) = L(g(y,yhat)). This will clean up some code in LossFunctions.jl that is currently looping over MarginLoss and DistanceLoss with macros.

  • The package has zero dependencies now. To achieve this, I've fixed the docstrings to be pure with only unicode characters as opposed to LaTeX escaped strings. I've also refactored the constructors of some of the aggregation modes which were calling StatsBase.weights to convert vectors of numbers to weights objects. This converstion can happen in the client code instead.

Breaking

  • I've removed the names value_deriv and value_deriv! from the costs interface for now because none of the direct dependencies of the package are exploiting this minor optimization. To be honest, it is very unlikely that we will gain much speedup with this optimization in practice. It certaintly increases the maintenance burden.

  • As this module is intended for use by developers, it no longer exports the names. The packages that implement the concepts can export a subset of the names and that will be much easier to maintain. Developers can always create shortcuts like const LB = LearnBase and const ObsDim = LearnBase.ObsDim inside their modules to control code verbosity. It is important however that they quality the functions explicitly to avoid name clashes with other initiatives like StatsBase.jl for example.

Overall, I am very satisfied with this reorganization of the source, and feel ready to improve the downstream packages. In particular, after this PR is merged, I will come back to the other major refactoring in JuliaML/LossFunctions.jl#130 and rebase it to the cleaned API. Looking forward to improving the API further.

@juliohm
Copy link
Member Author

juliohm commented Apr 15, 2020

Update

I realized that the UnarySupervisedLoss type I had initially proposed is somewhat redundant. I've removed it in this latest commit, in which case the type hierarchy of losses remains unchanged.

@juliohm juliohm merged commit d49d646 into master Apr 15, 2020
@juliohm juliohm deleted the refactor branch April 15, 2020 13:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants